Spatial Autocorrelation

GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R

Stefan Jünger & Dennis Abel

2025-04-10

Now

Day Time Title
April 09 10:00-11:30 Introduction
April 09 11:30-11:45 Coffee Break
April 09 11:45-13:00 Data Formats
April 09 13:00-14:00 Lunch Break
April 09 14:00-15:30 Mapping I
April 09 15:30-15:45 Coffee Break
April 09 15:45-17:00 Spatial Wrangling
April 10 09:00-10:30 Mapping II
April 10 10:30-10:45 Coffee Break
April 10 10:45-12:00 Applied Spatial Linking
April 10 12:00-13:00 Lunch Break
April 10 13:00-14:30 Spatial Autocorrelation
April 10 14:30-14:45 Coffee Break
April 10 14:45-16:00 Spatial Econometrics & Outlook

Thus far

We’ve done some wrangling, mapping, and linking of geospatial data (with georeferenced survey data)

We’ve seen that geospatial data are relevant to provide context (as social scientists, we know that space is important), and they are nice to look at–we can tell a story!

However, geospatial data can be interesting on their own for social science studies!

Tobler’s first law of geography

[E]verything is related to everything else, but near things are more related than distant things (Tobler 1970, p. 236)1

This means nearby geographical regions, institutions, or people are more similar or have a stronger influence on each other.

What we get is an interdependent system.

Spatial Interdependence or Autocorrelation

Tobler’s law is the fundamental principle of doing spatial analysis. We want to know

  1. If observations in our data are spatially interdependent
  2. And how this interdependence can be explained (= data generation process)

Developing a model of connectiveness: the chess board

Rook and queen neighborhoods

It’s an interdependent system

Let’s do it hands-on: Our ‘research’ question

Say we are interested in AfD voting outcomes in relation to ethnic compositions of neighborhoods.

  • Combination of far-right voting research with Allport’s classic contact theory
  • We are just doing it in the Urban context of Cologne (again)

Voting districts

voting_districts <-
  sf::st_read("./data/Stimmbezirk.shp") |> 
  dplyr::transmute(Stimmbezirk = as.numeric(nummer))

head(voting_districts, 2)
Reading layer `Stimmbezirk' from data source 
  `C:\Users\mueller2\a_talks_presentations\gesis-workshop-geospatial-techniques-R-2025\data\Stimmbezirk.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 543 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 343914.7 ymin: 5632759 xmax: 370674.3 ymax: 5661475
Projected CRS: ETRS89 / UTM zone 32N
Simple feature collection with 2 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 354181.4 ymin: 5642934 xmax: 356401.7 ymax: 5644951
Projected CRS: ETRS89 / UTM zone 32N
  Stimmbezirk                       geometry
1       10205 MULTIPOLYGON (((354878.4 56...
2       10213 MULTIPOLYGON (((356057.1 56...

AfD vote share

afd_votes <-
  glue::glue(
    "https://www.stadt-koeln.de/wahlen/bundestagswahl/09-2021/praesentation/\\
    Open-Data-Bundestagswahl476.csv"
  ) |> 
  readr::read_csv2() |>
  dplyr::transmute(Stimmbezirk = `gebiet-nr`, afd_share = (F1 / F) * 100)

head(afd_votes, 2)
# A tibble: 2 × 2
  Stimmbezirk afd_share
        <dbl>     <dbl>
1       10101      13.1
2       10102      11.4

Do vote shares spatially cluster?

ggplot() +
  geom_sf(
    data = election_results,
    aes(fill = afd_share)
    ) +
  scale_fill_viridis_c()

Pull in German Census data

immigrants_cologne <- terra::rast("./data/immigrants_cologne.tif")

inhabitants_cologne <- terra::rast("./data/inhabitants_cologne.tif")

immigrant_share_cologne <-
  (immigrants_cologne / inhabitants_cologne) * 100

It’s raster data

ggplot() +
  tidyterra::geom_spatraster(
    data = immigrant_share_cologne,
    aes(fill = immigrants_cologne)
    ) +
  scale_fill_viridis_c()

Linking: Let’s get geographical

As the voting (vector) data differs from the Census raster data, we cannot use simple ID matching like before.

  • We have to rely on spatial linking techniques
  • We could use terra::extract()
    • But as a default, it only captures raster cells as a whole and not their spatial fraction
    • Which is honestly okay for most applications
    • But why not try something else?

exactextractr::exact_extract()!

election_results <-
  election_results |>
  dplyr::mutate(
    immigrant_share = 
      exactextractr::exact_extract(
        immigrant_share_cologne, election_results, 'mean', progress = FALSE
      ),
    inhabitants = 
      exactextractr::exact_extract(
        inhabitants_cologne, election_results, 'mean', progress = FALSE
      )
  )

head(election_results, 2)
Simple feature collection with 2 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 354181.4 ymin: 5642934 xmax: 356401.7 ymax: 5644951
Projected CRS: ETRS89 / UTM zone 32N
  Stimmbezirk afd_share                       geometry immigrant_share
1       10205   9.74026 MULTIPOLYGON (((354878.4 56...        15.00740
2       10213  11.76471 MULTIPOLYGON (((356057.1 56...        10.66915
  inhabitants
1    140.1017
2    107.6079

Voilà

ggplot() +
  geom_sf(
    data = election_results,
    aes(fill = immigrant_share)
    ) +
  scale_fill_viridis_c()

How to test spatial autocorrelation

We now have to ask

  • Do the spatial units relate to each other?
  • If yes, in which way?
    • Only if they are bordering each other? (i.e., Queens or Rooks)
    • Or also if they are in proximity but not necessarily contiguous?

Let’s try Queens neighborhoods

queens_neighborhoods <-
  spdep::poly2nb(
    election_results,
    queen = TRUE
  )

summary(queens_neighborhoods)
Neighbour list object:
Number of regions: 543 
Number of nonzero links: 3120 
Percentage nonzero weights: 1.058169 
Average number of links: 5.745856 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
  1   9  48  89 137  97  70  43  19  17   4   5   2   1   1 
1 least connected region:
69 with 1 link
1 most connected region:
387 with 15 links

Connected regions

queens_neighborhoods |>
  spdep::nb2lines(
    coords = sf::st_as_sfc(election_results), 
    as_sf = TRUE
  ) |>
  tm_shape() +
  tm_dots() +
  tm_lines()

Can we now start?

Unfortunately, we are not yet done with creating the links between neighborhoods. What we receive is, in principle, a huge matrix with connected observations.

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
1     0    0    0    0    0    0    0    0    0     0
2     0    0    0    0    0    0    0    1    1     0
3     0    0    0    1    1    1    1    0    1     0
4     0    0    1    0    1    1    1    0    1     0
5     0    0    1    1    0    1    0    0    0     0
6     0    0    1    1    1    0    1    1    1     0
7     0    0    1    1    0    1    0    1    1     0
8     0    1    0    0    0    1    1    0    1     0
9     0    1    1    1    0    1    1    1    0     0
10    0    0    0    0    0    0    0    0    0     0

That’s nothing we could plug into a statistical model, such as a regression or the like (see next session).

Normalization

Normalization is the process of creating actual spatial weights. There is a huge dispute on how to do it (Neumayer & Plümper, 2016)1. But nobody questions whether it should be done in the first place since, among others, it restricts the parameter space of the weights.

  [,1] [,2] [,3] [,4] [,5]
1    0    0    0    0    0
2    0    0    0    0    0
3    0    0    0    1    1
4    0    0    1    0    1
5    0    0    1    1    0
[1] 0 0 2 2 2
  [,1] [,2] [,3]       [,4]       [,5]
1    0    0  0.0 0.00000000 0.00000000
2    0    0  0.0 0.00000000 0.00000000
3    0    0  0.0 0.08333333 0.08333333
4    0    0  0.2 0.00000000 0.20000000
5    0    0  0.2 0.20000000 0.00000000
[1] 0.0000000 0.0000000 0.1666667 0.4000000 0.4000000

Row-normalization

One of the disputed but, at the same time, standard procedures is row-normalization. It divides all individual weights (=connections between spatial units) \(w_{ij}\) by the row-wise sum of of all other weights:

\[W = \sum_j\frac{w_{ij}}{\sum_jw_{ij}}\]

An alternative would be minmax-normalization:

\[W = \sum_j\frac{w_{ij} - min(w_{ij})}{max(w_{ij})-min(w_{ij})}\]

Apply row-normalization

queens_W <- spdep::nb2listw(queens_neighborhoods, style = "W")

summary(queens_W)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 543 
Number of nonzero links: 3120 
Percentage nonzero weights: 1.058169 
Average number of links: 5.745856 
Link number distribution:

  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
  1   9  48  89 137  97  70  43  19  17   4   5   2   1   1 
1 least connected region:
69 with 1 link
1 most connected region:
387 with 15 links

Weights style: W 
Weights constants summary:
    n     nn  S0       S1       S2
W 543 294849 543 201.1676 2261.458

Test of spatial autocorrelation: Moran’s I

\[I=\frac{N}{\sum_{i=1}^N\sum_{j=1}^Nw_{ij}}\frac{\sum_{i=1}^{N}\sum_{j=1}^Nw_{ij}(x_i-\bar{x})(x_j-\bar{x})}{\sum_{i=1}^N(x_i-\bar{x})^2}\]

Most and foremost, Moran’s I use the previously created weights between all spatial unit pairs \(w_{ij}\). It weights deviations from an overall mean value of connected pairs according to the strength of the modeled spatial relations. Moran’s I can be interpreted as a correlation coefficient (-1 = perfect negative spatial autocorrelation; +1 = perfect positive spatial autocorrelation).

Moran’s I in spdep

spdep::moran.test(
  election_results$immigrant_share, 
  listw = queens_W
)

    Moran I test under randomisation

data:  election_results$immigrant_share  
weights: queens_W    

Moran I statistic standard deviate = 20.897, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.5398961097     -0.0018450185      0.0006720411 

Test of spatial autocorrelation: Geary’s C

Moran’s I is a global statistic for spatial autocorrelation. It can produce issues when there are only local clusters of spatial interdependence in the data. An alternative is the use of Geary's C:

\[C=\frac{(N-1)\sum_i\sum_jw_{ij}(x_i-x_j)^2}{2\sum_{i=1}^N\sum_{j=1}^Nw_{ij}\sum_i(x_i-\bar{x})^2}\]

As you can see, in the numerator, the average value \(\bar{x}\) is not as prominent as in Moran’s I. Geary’s C only produces values between 0 and 2 (value near 0 = positive spatial autocorrelation; 1 = no spatial autocorrelation; values near 2 = negative spatial autocorrelation).

Geary’s C in spdep

spdep::geary.test(
  election_results$immigrant_share, 
  listw = queens_W
)

    Geary C test under randomisation

data:  election_results$immigrant_share 
weights: queens_W   

Geary C statistic standard deviate = 16.951, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
     0.4649079513      1.0000000000      0.0009965167 

Modern inferface: sfdep package

The sfdep package provides a more tidyverse-compliant syntax to spatial weights. See:

election_results <-
  election_results |> 
  dplyr::mutate(
    neighbors = sfdep::st_contiguity(election_results), # queen neighborhoods by default
    weights = sfdep::st_weights(neighbors)
  )

head(election_results, 2)
Simple feature collection with 2 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 354181.4 ymin: 5642934 xmax: 356401.7 ymax: 5644951
Projected CRS: ETRS89 / UTM zone 32N
  Stimmbezirk afd_share                       geometry immigrant_share
1       10205   9.74026 MULTIPOLYGON (((354878.4 56...        15.00740
2       10213  11.76471 MULTIPOLYGON (((356057.1 56...        10.66915
  inhabitants                          neighbors
1    140.1017 16, 18, 22, 25, 458, 459, 460, 462
2    107.6079                   8, 9, 11, 48, 49
                                                 weights
1 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125
2                                0.2, 0.2, 0.2, 0.2, 0.2

Calculating once again Moran’s I

library(magrittr)

election_results %$% 
  sfdep::global_moran_test(immigrant_share, neighbors, weights)

    Moran I test under randomisation

data:  x  
weights: listw    

Moran I statistic standard deviate = 20.897, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.5398961097     -0.0018450185      0.0006720411 

Calculating once again Geary’s C

election_results %$% 
  sfdep::global_c_test(immigrant_share, neighbors, weights)

    Geary C test under randomisation

data:  x 
weights: listw   

Geary C statistic standard deviate = 16.951, p-value < 2.2e-16
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
     0.4649079513      1.0000000000      0.0009965167 

Exercise 7: Neighborhood Matrices

Exercise

Measures of local spatial autocorrelation: LISA clusters

We show you the sfdep package because it provides nice functions to calculate local measures of spatial autocorrelation. One popular choice is the estimation of Local Indicators of Spatial Autocorrelation (i.e., LISA clusters). Most straightforwardly, they can be interpreted as case-specific indicators of spatial autocorrelation:

\[I_i=\frac{x_i-\bar{x}}{\frac{\sum_{i-1}^N(x_i-\bar{x})^2}{N}}\sum_{j=1}^Nw_{ij}(x_j-\bar{x})\]

Local Moran’s I in sfdep

lisa <- 
  election_results |> 
  dplyr::mutate(
    lisa = sfdep::local_moran(afd_share, neighbors, weights)
  ) |>
  tidyr::unnest()

head(lisa, 2)
Simple feature collection with 2 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 354181.4 ymin: 5644198 xmax: 354878.4 ymax: 5644951
Projected CRS: ETRS89 / UTM zone 32N
# A tibble: 2 × 19
  Stimmbezirk afd_share                     geometry immigrant_share inhabitants
        <dbl>     <dbl>           <MULTIPOLYGON [m]>           <dbl>       <dbl>
1       10205      9.74 (((354878.4 5644810, 354864…            15.0        140.
2       10205      9.74 (((354878.4 5644810, 354864…            15.0        140.
# ℹ 14 more variables: neighbors <int>, weights <dbl>, ii <dbl>, eii <dbl>,
#   var_ii <dbl>, z_ii <dbl>, p_ii <dbl>, p_ii_sim <dbl>, p_folded_sim <dbl>,
#   skewness <dbl>, kurtosis <dbl>, mean <fct>, median <fct>, pysal <fct>

It’s also nice for mapping

ggplot() +
  geom_sf(
    data = lisa,
    aes(fill = ii)
  ) +
  scale_fill_viridis_c()

One last bit: bivariate local Moran’s I

lisa_bivariate <- 
  election_results |> 
  dplyr::mutate(
    lisa = sfdep::local_moran_bv(
      afd_share, 
      immigrant_share, 
      neighbors, 
      weights
      )
  ) |> 
  tidyr::unnest()

ggplot() +
  geom_sf(
    data = lisa_bivariate,
    aes(fill = Ib)
  ) +
  scale_fill_viridis_c()

Wrap up

You now know how to model the connectedness of spatial units, investigate spatial autocorrelation globally and locally, and map it.

There’s way more, particularly regarding spatial weights (see exercise), clustering techniques (e.g., Hot Spot Analysis), or autocorrelation with more than one or two variables.

Nevertheless, now we know our data are spatially autocorrelated. Let’s try to find out why this is the case via some spatial econometrics